Dataframes

Quantitative Methodology (UPF)

Jordi Mas Elias

https://www.jordimas.cat/

Summary

  • What is a dataframe?
  • Observations
  • Variables
  • Recoding variables
  • Scope of data

What is a dataframe?

Table

It s a generic name. It can be almost anything.

  • Periodic table
  • Multiplication table
  • Truth table
  • Chi squared table
  • Phonetic table

Data(s)

  • Source of information (SI): Raw empirical material.
  • Data (s/p): Collected, processed, systematized and organized SI (Van Evera 2009).
    • Numbers, characters, symbols … no meaning.
  • Database: An organized collection of data stored and accessed electronically / An organized collection of data stored as multiple datasets.
  • Dataset: A structured collection of data generally associated with a unique body of work.

Spreadsheet

How Excel stores data in two dimensions:

Dataframe

A way1 to store data in R in two dimensions: rows and columns2:

# A tibble: 17,548 × 9
   scode country      year polity2 xrreg xrcomp xropen xconst parreg
   <chr> <chr>       <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 AFG   Afghanistan  1800      -6     3      1      1      1      3
 2 AFG   Afghanistan  1801      -6     3      1      1      1      3
 3 AFG   Afghanistan  1802      -6     3      1      1      1      3
 4 AFG   Afghanistan  1803      -6     3      1      1      1      3
 5 AFG   Afghanistan  1804      -6     3      1      1      1      3
 6 AFG   Afghanistan  1805      -6     3      1      1      1      3
 7 AFG   Afghanistan  1806      -6     3      1      1      1      3
 8 AFG   Afghanistan  1807      -6     3      1      1      1      3
 9 AFG   Afghanistan  1808      -6     3      1      1      1      3
10 AFG   Afghanistan  1809      -6     3      1      1      1      3
# … with 17,538 more rows

A tidy dataframe

We consider that a dataframe is tidy if it fulfills the following requirements (Wickham 2014):

  • Each dataframe has one unit of observation.
  • Observations are represented in the rows.
  • Variables are represented in the columns.
  • Each cell indicates a value.

RStudio workflow

Load packages: Everytime you join R.

library(dplyr)
library(ggplot2)
library(readr)
library(stringr)
library(forcats)

Observations

Observing …

We need to decide which are the units of interest.

What is an observation?

  • Unit of analysis: The thing that we want to know about.
    • Determined by the hypothesis / question.
  • Unit of observation: Each row of a dataframe.
    • Determined by the instrument of measurement.
# A tibble: 8 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
7 Afghanistan Asia       1982    39.9 12881816      978.
8 Afghanistan Asia       1987    40.8 13867957      852.

Example: Macro level

States, regions, legal systems …

# A tibble: 15 × 6
   WarName            WarType CcodeA SideA          CcodeB SideB                
   <chr>                <dbl>  <dbl> <chr>           <dbl> <chr>                
 1 First Caucasus           5    365 Russia             -8 Georgians, Dhagestan…
 2 Sidon-Damascus           6     -8 Sidon              -8 Damascus & Aleppo    
 3 First Two Sicilies       4    300 Austria            -8 -8                   
 4 First Two Sicilies       4    329 Two Sicilies       -8 Liberals             
 5 Spanish Royalists        4    230 Spain              -8 Royalists            
 6 Sardinian Revolt         4    300 Austria            -8 -8                   
 7 Sardinian Revolt         4    325 Sardinia           -8 Carbonari            
 8 Greek Independence       5    640 Ottoman Empire     -8 Greeks               
 9 Greek Independence       5     -8 -8                200 United Kingdom       
10 Greek Independence       5     -8 -8                220 France               
11 Greek Independence       5     -8 -8                365 Russia               
12 Egypt-Mehdi              6     -8 Egypt              -8 Mehdi army           
13 Janissari Revolt         4    640 Ottoman Empire     -8 Janissaries          
14 Miguelite War            4     -8 -8                200 United Kingdom       
15 Miguelite War            4    235 Portugal           -8 Constitutionalists   

Intra-State War Data (Correlates of War)

Example: Meso level

Organitzations, ethnic groups, political parties …

# A tibble: 14 × 5
   countryname  year groupname statusname     groupsize
   <chr>       <dbl> <chr>     <chr>              <dbl>
 1 Belgium      1967 Flemings  JUNIOR PARTNER     0.59 
 2 Belgium      1967 Walloon   SENIOR PARTNER     0.4  
 3 Belgium      1967 Germans   IRRELEVANT         0.01 
 4 France       1967 French    MONOPOLY           0.976
 5 France       1967 Basques   POWERLESS          0.013
 6 France       1967 Corsicans POWERLESS          0.004
 7 France       1967 Roma      DISCRIMINATED      0.006
 8 Belgium      1968 Flemings  JUNIOR PARTNER     0.59 
 9 Belgium      1968 Walloon   SENIOR PARTNER     0.4  
10 Belgium      1968 Germans   IRRELEVANT         0.01 
11 France       1968 French    MONOPOLY           0.976
12 France       1968 Basques   POWERLESS          0.013
13 France       1968 Corsicans POWERLESS          0.004
14 France       1968 Roma      DISCRIMINATED      0.006

International Conflict Research

Example: Micro level

Families, individuals, relationships …

# A tibble: 1,599 × 5
   age   language       urban_rural region    electricity_nearby
   <chr> <chr>          <chr>       <chr>     <chr>             
 1 26    Igbo           Urban       IMO       Yes               
 2 25    Other          Rural       FCT ABUJA Yes               
 3 35    Hausa          Rural       FCT ABUJA Yes               
 4 79    Other          Rural       FCT ABUJA Yes               
 5 19    English        Rural       FCT ABUJA Yes               
 6 34    Igbo           Urban       IMO       Yes               
 7 30    Pidgin English Rural       FCT ABUJA Yes               
 8 32    Hausa          Rural       FCT ABUJA Yes               
 9 50    Other          Rural       FCT ABUJA Yes               
10 18    English        Rural       FCT ABUJA Yes               
# … with 1,589 more rows

Autobarometer - Nigeria

Example: Events

Bombings, contracts, terrorist attacks…

# A tibble: 477 × 8
   cowcode region  year country    no  coup successful combat
     <dbl>  <dbl> <dbl> <chr>   <dbl> <dbl>      <dbl>  <dbl>
 1      40      5  1952 Cuba        1     1          1      1
 2      40      5  1957 Cuba        1     1          0      1
 3      41      5  1950 Haiti       1     1          1      0
 4      41      5  1956 Haiti       1     1          0      0
 5      41      5  1957 Haiti       1     1          1      0
 6      41      5  1957 Haiti       2     1          1      0
 7      41      5  1957 Haiti       3     1          1      0
 8      41      5  1958 Haiti       1     1          0      1
 9      41      5  1970 Haiti       1     1          0      0
10      41      5  1986 Haiti       1     1          1      0
# … with 467 more rows

Coup Agency and Mechanisms Dataset

Ecological fallacy

When the UA and the UO are not the same, we run the risk of having an ecological fallacy problem.

  • Admissions
# A tibble: 2 × 4
  sex   applicants admitted perc_admissions
  <chr>      <dbl>    <dbl>           <dbl>
1 Men         1331      691            51.9
2 Women       1093      394            36.0
  • Suicide
# A tibble: 2 × 4
  religion   population suicide  perc
  <chr>           <dbl>   <dbl> <dbl>
1 Catholic         2065     103  4.99
2 Protestant       8756     537  6.13

Ecological fallacy

Barcelona local elections: Neighbourhood level.

Ecological fallacy

Barcelona local elections: District level.

Ecological fallacy

Barcelona local elections: Census section level.

Variables

What is a variable?

A characteristic of the object we’re studying.

  • It varies across units.

Types of variables (I): Nominal

  • Municipality: Barcelona, Sant Cugat, Granollers…
  • Religion: Muslim, Catholic, Shinto…
  • Language: Russian, Catalan, Swedish.
  • Ideology: Conservatism, Nationalism, Liberal…
  • Parties: PSOE, PP, Cs, ERC…

stringr (stringr2022?). En aquest apartat no ens detindrem a explicar totes les funcions del paquet. El més adequat és consultar el seu Cheatsheet corresponent a la pàgina d’RStudio. Però veurem algunes funcions a continuació amb al mateix marc de dades strings que hem creat anteriorment:

Types of variables (I): Ordinal

Things: Small, medium, large. Age: Child, Young, Adult. Education: Primary, Secondary, Tertiary. Ideas: Agree, Neutral, Disagree.

Aid Transparency Index (ATI) 2022, Publish What You Fund. Economist Intelligence Unit 2022.

Per manipular els factors, la millor eina que podem utilitzar és el paquet forcats (forcats2022?). En aquest apartat només veurem algunes de les seves funcions, però per explorar-lo més a fons podeu consultar el seu Cheatsheet corresponent a la pàgina d’RStudio. A continuació aplicarem algunes funcions al marc de dades ords que hem creat anteriorment. La majoria d’aquestes funcions van relacionades amb reordenar factors i són molt útils per a la visualització de gràfics.

Types of variables (III): Interval

Zero is arbitrary.

  • Year: 2004, 2005, 2008, 2010.
  • Temperature (except Kelvin): 10, 25, 30.
  • Ideology: Left-right measured as 0-10.
  • Coordinates: Longitude and latitude.

Polity V

[1]  3  9 13

Types of variables (III): Ratio

Zero has meaning

  • Age: 2, 5, 7, 9
  • Percentages: 0%, 100%, 34%…
  • Population: 13000000, 200, 33450000

National Material Capabilities (NMC) dataset (Singer 1987; Singer and Small 1972).

Types of variables (IV): Summary

Tipus Característiques Vector Operacions
Categòrica nominal Categories no ordenables Caràcter o factor ==, !=
Categòrica ordinal Categories ordenables Factor ==, !=, <=, <, >, >=
Numèrica d’interval Nombres, zero sense significat Numèric o enter ==, !=, <=, <, >, >=, +, -
Numèrica de ràtio Nombres, zero amb significat Numèric ==, !=, <=, <, >, >=, +, -, *, / …

Recoding variables

Summary recoding

Table 1: Variable de destí i funció que es necessita
Destí Funció
Binària if_else()
Categòrica case_when()
Ordinal factor()
Qualsevol recode()
Altres as.numeric(), as.character(), as.Date(), etc.

dd

Bibliography

Singer, J. David. 1987. Reconstructing the Correlates of War Dataset on Material Capabilities of States, 1816-1985.” International Interactions 14: 115–32.
Singer, J. David, and Melvin Small. 1972. The wages of war, 1816-1965: a statistical handbook. New York: Wiley.
Van Evera, Stephen. 2009. Guía para Estudiantes de Ciencia Política: Métodos y Recursos. Barcelona: Gedisa.
Wickham, Hadley. 2014. Tidy Data.” Journal of Statistical Software 50 (10): 1–23.